1 Introduction

Pericytes are a type of cell that enwraps small blood vessels throughout all vascularized tissues of the body. Pericytes are thought to be a key effector cell that modulate blood vessel behavior and can profoundly influence clinical outcomes in chronic disease. However, pericytes are a difficult cell type to study because they lack exclusive cellular markers, making identification and isolation difficult. Currently, pericytes are identified by a panel of markers, their cell shape, and their perivascular position relative to capillaries (Fig 1). Recently, a study by the Yuan Lab [1] at Harvard Medical School sought to identify novel pericyte-specific markers using single cell RNA-seq from the Tabula Muris Atlas. The Tabula Muris atlas is a compendium of single cell transcriptome data from the model organism Mus musculus, containing nearly 100,000 cells from 20 organs and tissues [2].

Figure 1: Diagram of various cell types found in vascularized tissues. Smooth muscle cells, pericytes, fibroblasts, and mesenchymal stem cells all share similiar marker expression and are often defined by their position relative to the vasculature and cell shape.
Figure 1: Diagram of various cell types found in vascularized tissues. Smooth muscle cells, pericytes, fibroblasts, and mesenchymal stem cells all share similiar marker expression and are often defined by their position relative to the vasculature and cell shape.


The authors used UMAP to visualize gene expression from clustering with a KNN graph based on Euclidean distance generated with linear dimension reduction with PCA. Pericyte associated clusters were identified with high co-expression of two pericyte markers, CSPG4 and PDGFRB. A subclass of cells in this group with exceptionally high co-expression were identified as “stringent” pericytes and examined further. Differential gene expression analysis of stringent pericytes against all other cell types yielded a panel of novel candidate pericyte markers. While this study identified interesting new potential pericyte markers for specific research, there are potential issues with the methodology used for the processing pipeline that will yield a better understanding of the results:

  1. Data Integration: the authors combined datasets from various tissues into a single dataset using the IntegrateData() function in Suerat. Data integration is useful for eliminating batch effects and other uninteresting sources of variation when combining data across replicates of the same tissue (horizontal integration), or combining measurements from different technologies of the same cells (vertical integration), this study combines data from different organs. Such an integration process is not required for identifying cell markers across tissues- markers can be identified for each tissue individually without any integration. There is a possibility that using integration could drown out signal in tissues where the overall expression of the cell markers of interest is less than average.
  2. Clustering Effectiveness: the authors do not determine if their clustering analysis was effective in separating distinct cell populations within the tissue. Cluster can be evaluated based on their composition of cell types using annotations provided from the Tabula Muris atlas
  3. Identity of Pericyte Cluster: the cluster that had high expression of CSPG4 and PDGFRB markers were designated as pericytes for this study, but these markers are also expressed in other cell types (1-3).

These queries motivate the following research questions:
1. Will the same overall trends in clustering still appear if the datasets are not integrated across tissue types?
2. Does the clustering approach used in the study effectively delineate between the manually annotated cell types in the database?
3. Do the pericyte designated clusters align with the manual annotations that already exist in the database?


Primary Paper and Dataset References
[1] S.-H. Baek, E. Maiorino, H. Kim, K. Glass, B. A. Raby, K. Yuan, Single Cell Transcriptomic Analysis Reveals Organ Specific Pericyte Markers and Identities. Frontiers in Cardiovascular Medicine. 9 (2022) (available at https://www.frontiersin.org/articles/10.3389/fcvm.2022.876591).
[2] The Tabula Muris Consortium., Overall coordination., Logistical coordination. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018). https://doi.org/10.1038/s41586-018-0590-4.

Primary Packages: Seurat, AnnotationHub, ggplot, tidyverse, pheatmap.
Supporting Packages: cowplot, RColorBrewer, BiocManager, stringr

>> Code for loading packages and initializing parallel processing.

# Note: this code assumes that the working directory is set to the source file 
# location

# Install required Base Packages
base_packages <- c("Seurat", "ggplot2", "tidyverse", "cowplot", "RColorBrewer",
                   "BiocManager", "stringr", "pheatmap", "future", "rmdformats",
                   "memuse")
install.packages(setdiff(base_packages, rownames(installed.packages())))  
# Install required Bioconductor Packages
biocm_packages <- c("AnnotationHub", "ensembldb")
bioc_installs <- setdiff(biocm_packages, rownames(installed.packages()))
if (length(bioc_installs)) {BiocManager::install(bioc_installs) }

# Load base packages
lapply(base_packages, library, character.only = TRUE)
# Load Bioconductor packages packages
lapply(biocm_packages, library, character.only = TRUE)

# Setup multicore processing
plan(multisession, workers = max(c(parallelly::availableCores()-2,1)))
# Set max ram to 90% of available
options(future.globals.maxSize = as.numeric(memuse::Sys.meminfo()[[2]]) *.9)

2 Methods

For the purposes of clarity and reproducibility, below are summaries of the processing pipeline used in this project, along with an explanation of the gene annotation database and supporting files necessary to carry out the analysis.

2.1 Pipeline


Figure 2.1: Porcessing of pipeline for this tudy. Steps bridged by white arrows are considered standard workflow for seurant, while grey arrows are custom analysis for this projects.
Figure 2.1: Porcessing of pipeline for this tudy. Steps bridged by white arrows are considered standard workflow for seurant, while grey arrows are custom analysis for this projects.


Below are function parameters that were used in the reference publication for processing. 1. NormalizeData(normalization.method = “LogNormalize”, scale.factor = 10000)
2. FindVariableFeatures(selection.method = “vst”, nfeatures = 2000)
3. ScaleData(model.use = “linear”,)
4. RunPCA()
5. Dimensionality: Bladder: 30, Kidney: 45, Lung and Heart: 45 5. FindNeighbors(dims = 1:dimensionality)
6. FindClusters(resolution = 0.5, algorithm=1)
7. RunUMAP()


2.2 Gene Annotation


Figure 2.2: Conversion of ensemble IDs to gene names where done from the latest annotation files from Mus musculus in the AnnotateHub Bioconductor package.
Figure 2.2: Conversion of ensemble IDs to gene names where done from the latest annotation files from Mus musculus in the AnnotateHub Bioconductor package.




Seurat single cell datasets typically record gene expression by their ensemble ID which use unique static codes to identify features in the genome. However, there are not human readable, so we must convert ensemble IDs to the gene name with annotated gene database. Pipelines typically use bioMart or AnnotationHub as a reference database. AnnotationHub was used because it maintains a 1:1 mapping from ensemble ID’s to gene names, while bioMart does not because it has a one to many relationship between ensemble IDs and entrez IDs.

Code for annotating genes with gene name.

# Check for gene annotation file
if (!file.exists(paste0(getwd(),"/data/annotations_Mm.csv"))) {
# Connect to Annotation Hub as the source for gene annotations
ah <- AnnotationHub()
# Access the Ensemble database for organism
ahDb <- query(ah, pattern = c("Mus musculus", "EnsDb"),    ignore.case = TRUE)
# Acquire the latest annotation files
id <- ahDb %>% mcols() %>% rownames() %>% tail(n = 1)
# Download the appropriate Ensembldb database
edb <- ah[[id]]

# Extract gene-level information from database
annotations <- genes(edb, return.type = "data.frame")
# Select annotations of interest
annotations <- annotations %>%
  dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)

# Export table to csv
data.table::fwrite(annotations, file = paste0(getwd(),"/data/annotations_Mm.csv"))
}
# Load ensembl <-> gene annotation conversion table
gene_annotations <- data.table::fread(file = paste0(getwd(),"/data/annotations_Mm.csv"),
                                      header = TRUE)

2.3 Datafiles


Figure 2.3: Tabula Muris Atlas.

Seurat datafiles provided by the Tabula Muris were used in this analysis for bladder, heart, kindey, and lung tissues.

Direct Link to Seurat datafiles found on Figshare.

Data sets were renamed to [tissue].rds and saved in the /data/ subfolder of the project repository.

  1. bladder.rds
    348 MB, MD5 checksum: 12349521850bed75aaba9a4abb5daf48

  2. heart.rds
    81 MB, MD5 checksum: f4740230d092fe5c072a3eacb5f533e6

  3. kidney.rds
    378 MB, MD5 checksum: afae3d5895401126c51090f5d6f9299a

  4. lung.rds
    752 MB,MD5 checksum: f6a12d88b19e7ecd94256c0f0f0ed284


2.4 Code

>> Code for processing Seurat pipeline.

# Get dataset filenames, assumed to be named by tissue type
data_filenames <- str_replace(list.files(paste0(getwd(),'/data/'),pattern='rds',
                                         full.names = F), '.rds','')[c(4,2,3,1)]


# Specify dimensions for linear dimensional reduction (authors used in paper)
dim_thresh <- c(30,35,45,35)
# SPecify clusters of interest for each tissue
clusters_oi <- list(13,c(11,10),19,24)

# Load all Seurat objects into list
seurat_objs <- list()

# Process and cluster each tissue data set individually (do not integrate)
if (!file.exists(paste0(getwd(),"/results/seurat_objs.rdata"))) {
  for(x in seq_along(data_filenames)){
    # Load seurat object
    seurat_obj  <- readRDS(paste0(getwd(),'/data/', data_filenames[x], '.rds'))

    # Create cleaned cell type labels
    clean_labels <- function(x) 
      str_replace_all(str_replace_all(x, paste0(data_filenames[x]," "),""), 
                      c("cell "="", "^nan$" = "unknown"))
    seurat_obj@meta.data$cell_label <- 
      factor(str_to_title(clean_labels(seurat_obj@meta.data$free_annotation)))
    
    # Normalize expression by fraction of total, * 10,000, and log-transform
    seurat_objs[[x]] <- NormalizeData(object = seurat_obj)
    
    # Calculates z-score of gene expression dispersion within dataset (binned by their dispersion)
    seurat_objs[[x]] <- FindVariableFeatures(object = seurat_objs[[x]])
    
    # Eliminate uninteresting sources of variation (batch effects, technical noise, cell cycle)
    # By regressing out these signals with linear models and outputting the residuals
    seurat_objs[[x]] <- ScaleData(object = seurat_objs[[x]])
    
    # Perform PCA on residuals to compress dataset and reduce noise from inconsequential genes
    seurat_objs[[x]] <- RunPCA(object = seurat_objs[[x]])
    
    # Automated alternative for determining number of PCs to include
    # JackStrawPlot(object = seurat_objs[[x]], PCs = 1:50)
    
    # Construct KNN graph refined with Jaccard similarity between local neighborhoods
    seurat_objs[[x]] <- FindNeighbors(object = seurat_objs[[x]], dims = 1:dim_thresh[x])
    # Cluster cells with Louvain algorithm (default) or SLM
    seurat_objs[[x]] <- FindClusters(object = seurat_objs[[x]], resolution = 0.5)
    # Run non-linear dimension reduction to reveal underlying manifold of the data
    seurat_objs[[x]] <- RunUMAP(object = seurat_objs[[x]], dims = 1:dim_thresh[x])
 

  }
  # Export seurat object
  names(seurat_objs) <- data_filenames
  save(seurat_objs, file = paste0(getwd(),"/results/seurat_objs.rdata"))
  
} else {
  # If seurat datafiles already computed, load from disk
  load(paste0(getwd(),"/results/seurat_objs.rdata"))
  
}

>> Code for visualizations.

# Visualizations of dimensions, clusters, and gene expression
# Load all Seurat objects into lists
gg_elbows <- list()
gg_umaps <- list()
gg_dots <- list()
gg_cows <- list()
gg_annot_umap = list()
gg_annot = list()
gg_annot_table= list()
# Visualizations
for(x in seq_along(data_filenames)){
  
  # Convert cell type labels 
  remove_list = c("Epcam","Pecam", data_filenames[x], "cell")
  # 
  clean_labels = function(k) str_replace_all(str_replace_all(k, paste(remove_list,collapse="|"),""),
                                             c("\\s+"=" ", "^nan$" = "unknown"))
  seurat_objs[[x]]@meta.data$cell_label <-
    factor(str_to_title(str_trim(clean_labels(seurat_objs[[x]]@meta.data$free_annotation))))
  
  # Determine number of PCs to use
  gg_elbows[[x]] <- ElbowPlot(seurat_objs[[x]], ndims = 50, reduction = "pca") + 
    ggtitle(tools::toTitleCase(data_filenames[x]), ) +
    geom_vline(xintercept = dim_thresh[x], color = "red") +
    theme(plot.title = element_text(hjust = 0.5))
  save_plot(paste0(getwd(),'/results/elbow_', data_filenames[x], '.png'),
            gg_elbows[[x]], base_width = 3, base_height = 3)
  
    
  # Visualize clusters
  gg_temp <- DimPlot(seurat_objs[[x]], reduction = 'umap', label = FALSE, repel = TRUE) +
    ggtitle(sprintf("%s: %s Total Cells", tools::toTitleCase(data_filenames[x]), 
                    format(dim(seurat_objs[[x]])[2],big.mark=",", trim=TRUE)))  +
    theme_void() + theme(legend.position="none",plot.title = element_text(size=12,hjust = 0.5))
  gg_umaps[[x]] <- LabelClusters(gg_temp, id = "ident",  fontface = "bold", size = 6)
  
  
  # Create a dot plot of expression of PC markers
  gg_dots[[x]] <- DotPlot(seurat_objs[[x]], features = c("ENSMUSG00000032911","ENSMUSG00000024620")) + 
    scale_x_discrete(labels=c('CSPG4', 'PDFRB'),position = "top")  + xlab("") + ylab("Cluster") +
    scale_y_discrete(position = "right") + #scale_color_gradient(title) +
    scale_size_area(breaks=c(25,50,75), limits = c(0, 75)) +
    scale_color_gradient(breaks = rev(c(0,1,2)), limits = c(0, 2.5), low = "#D1CFD4", high = "#1E0BFE")+
    guides(size=guide_legend(title="% Exp."), colour = guide_colourbar(title="Mean Exp.")) +
    theme(text=element_text(size=12),plot.margin = unit(c(0, 0, 0, 0), "cm"),
          axis.text.x = element_text(size =12,angle = 45, hjust =0), axis.text.y = element_text(size =12),
          axis.ticks=element_blank(),panel.grid.major=element_blank(),
          legend.position = "none",
          panel.grid.minor = element_blank())
  
  # Create grid of cluster and expresison table, then add legend at bottom
  gg_cluster_expr <- plot_grid(gg_umaps[[x]], gg_dots[[x]], ncol = 2, rel_widths = c(3, 1))
  # Add a second row with a legend
  legend <- get_legend(gg_dots[[x]]+theme(legend.position = "bottom",legend.justification="center"))
  gg_cows[[x]] <- plot_grid(gg_cluster_expr, legend, nrow = 2, rel_heights = c(1, .1))
  save_plot(paste0(getwd(),'/results/umap_organ_', data_filenames[x], '.png'),
            gg_cows[[x]], base_width = 8, base_height = 6)
  
  
  # Visualize Cluster with cell type annotated by atlas
  gg_annot_umap[[x]] <- DimPlot(seurat_objs[[x]], reduction = 'umap', group.by = "cell_label", 
                                label = FALSE, repel = TRUE) +
    ggtitle(sprintf("%s: %s Total Cells", tools::toTitleCase(data_filenames[x]), 
                    format(dim(seurat_objs[[x]])[2],big.mark=",", trim=TRUE)))  +
    theme(legend.position="bottom",text = element_text(size = 9),
          plot.title = element_text(size=12,hjust = 0.5), legend.key.size = unit(0, 'cm'), 
          legend.spacing = unit(0.1, 'cm'), legend.margin = unit(0.1, 'cm')) +
    guides(color=guide_legend(nrow=c(8,3,5,3)[x], byrow=TRUE, override.aes = list(size=3))) 
  gg_annot_umap[[x]]
  
  
  # Calculate percentage of cell type per cluster
  tab <- table(Assigned=seurat_objs[[x]]@meta.data$cell_label, Clusters=seurat_objs[[x]]$seurat_clusters)
  perc_tab <- sweep(tab, 2, colSums(tab ), FUN = '/')
  gg_annot_table[[x]] <- pheatmap(perc_tab, color = colorRampPalette(c('white','blue'))(10),
                                  cluster_rows = F, cluster_cols = F, fontsize = 9)
  
  gg_annot[[x]] <- plot_grid(gg_annot_umap[[x]], gg_annot_table[[x]][[4]], nrow = 2, rel_widths = c(1, 1))
  
  save_plot(paste0(getwd(),'/results/umap_annot_', data_filenames[x], '.png'),
            gg_annot[[x]], base_width = 2*6, base_height = 2*6)
  
} 

2.5 Dimensionality

A significance step in processing higher dimensional data is the linear reduction that compresses datapoints into a series of hyperplanes that comprise a linear combination of the features in the dataset. This data used the same dimensional thresholds that the authors used in the original study because the thresholds seemed reasonable as visualized with an elbow plot for each tissue.

2.5.1 Lung


Figure 2.4.1: Standard deviation captured by increasing dimensions of linear reduction of the lung gene expression data with PCA.Authors set the dimensionality for the bladder dataset from 1 to 30 dimensions.


2.5.2 Heart


Figure 2.4.2: Standard deviation captured by increasing dimensions of linear reduction of the heart gene expression data with PCA.Authors set the dimensionality for the heart dataset from 1 to 35 dimensions.


2.5.3 Kidney


Figure 2.4.3: Standard deviation captured by increasing dimensions of linear reduction of the kidney gene expression data with PCA.Authors set the dimensionality for the kidney dataset from 1 to 45 dimensions.


2.5.4 Bladder


Figure 2.4.4: Standard deviation captured by increasing dimensions of linear reduction of the bladder gene expression data with PCA.Authors set the dimensionality for the lung dataset from 1 to 35 dimensions.


3 Results

To investigate effect of leaving data unintegrated for pericyte marker analysis, the data was reprecossed in Seurat following the standard workflow (Fig 2.1). The same settings were used as the original study, including the dimensional chosen for each dataset (meaning the PCA principal components that were used for linear dimension reduction and subsequent nonlinear clustering, Fig 2.4.1-4). UMAP clustering was used to visualize the data with average expression of pericyte markers CPSG4 and PDGFRB visualized, along with a heatmap of the cell type composition of each cluster according to annotations from the Tabula Muris atlas.

3.1 Lung

Figure 3.2.1A: UMAP clustering of gene expression from cells isolated from lung issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 156 cells comprising cluster 13 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 284 cells from cluster 16 in the study).

Figure 3.2.1B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.


3.2 Heart

Figure 3.2.2A: UMAP clustering of gene expression from cells isolated from heart issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 394 cells comprising clusters 10 and 11 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 392 cells from clusters 10 and 12 in the study).

Figure 3.2.2B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.


3.3 Kidney

Figure 3.2.3A: UMAP clustering of gene expression from cells isolated from kidney issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 229 cells comprising cluster 19 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 358 cells from cluster 17 in the study).

Figure 3.2.3B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.


3.4 Bladder

Figure 3.2.4A: UMAP clustering of gene expression from cells isolated from lung issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 62 cells comprising cluster 24 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 345 cells from clusters 10 and 11 in the study).

Figure 3.2.4B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.


4 Discussion

Overall, the clustering approach without data integration yielded similar pericyte like clusters as those found in the original study. However, for all tissues, there appeared to be a higher fraction of cells expressing both pericyte clusters with this modified clustering approach, suggesting that avoiding integration resulted in more homogenous and purified clusters. Compared to the original study, about 10% of cells in lung tissue expressed both markers (compared to 35%), 25% of cells in heart tissue (compared to 50%), 30% of cells in kidney tissue (compared to 50%), and 10% of cells in bladder (compared to 25%).

When compared to the cell type annotation provided by the Tabula Muris database, the unsupervised clustering performed reasonably well in separating them into distinct groups (Figure 3.2.[1-4]B). For lung tissue, clusters appeared to be split between cell types that have a high degree of similarity, such as T cell types or myeloid dendritic cells (Fig 3.2.1B). Similar trends were observed with heart tissue, although there were a multitude of cell types with unknown cellular origins for this tissue type (Fig 3.2.2B). For kidney tissue, overlap was mostly seen for immune cells and endothelial cells (Fig 3.2.3B). For bladder tissue, there was overlap with epithelial cell types, but otherwise most clusters did not map to multiple cell types (Fig 3.2.4B).

The cluster of cells that had high expression of CSPG4 and PDGFRB and deemed to be a source of pericytes did not always match their annotated cell type from the atlas. For lungs, the pericyte cluster were labeled as pericytes, but were labeled as smooth muscle cells in the heart, stroma mesangial cells in the kidney, and endothelial cells in the bladder. These discrepancies will require further follow up with the literature and more detailed marker expression of these groupings, and also highlights the difficulty in determining clean groupings for cell type.

Future Questions
1. Does that marker expression of the stringent pericyte cluster contain various pericyte markers shown in the literature?
2. Are there cells with marker expression profiles that match pericytes found in other clusters?
3. Will the same results be observed if the data is reprocessed with recent advances in transcriptomic analysis (3)?

5 References

  1. Riew TR, Jin X, Kim S, Kim HL, Lee MY. Temporal dynamics of cells expressing CSPG4 and platelet-derived growth factor receptor-β in the fibrotic scar formation after 3-nitropropionic acid-induced acute brain injury. Cell Tissue Res. 2021 Sep;385(3):539-555. doi: 10.1007/s00441-021-03438-3. Epub 2021 Apr 17. PMID: 33864501.
  2. Z. Jiang, T. Feng, Z. Lu, Y. Wei, J. Meng, C.-P. Lin, B. Zhou, C. Liu, H. Zhang, PDGFRb+ mesenchymal cells, but not CSPG4+ mural cells, contribute to cardiac fat. Cell Reports. 34, 108697 (2021).
  3. Allan-Hermann Pool et al, Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references, Nature Methods (2023). DOI: 10.1038/s41592-023-02003-w



©2023 Bruce Corliss